Information processing apparatus and control method therefor

ABSTRACT

An information processing apparatus includes an image input unit which inputs image data containing a face, a face position detection unit which detects, from the image data, the position of a specific part of the face, and a facial expression recognition unit which detects a feature point of the face from the image data on the basis of the detected position of the specific part and determines facial expression of the face on the basis of the detected feature point. The feature point is detected at a detection accuracy higher than detection of the position of the specific part. Detection of the position of the specific part is robust to a variation in the detection target.

CROSS-REFERENCE TO RELATED APPLICATIONS

This application is a continuation of application Ser. No. 11/532,979,filed Sep. 19, 2006 the entire disclosure of which is herebyincorporated by reference.

BACKGROUND OF THE INVENTION

1. Field of the Invention

The present invention relates to an information processing apparatus andcontrol method therefore, particularly to an image recognitiontechnique.

2. Description of the Related Art

Conventionally, an object recognition (image recognition) technique isknown, which causes an image sensing device to sense an object toacquire image data and calculates the position and orientation of theobject by analyzing the image data.

Japanese Patent Laid-Open No. 09-282454 discloses the following objectrecognition technique. First, low-resolution object recognitionprocessing is executed to coarsely obtain the position and orientationof a whole recognition target object (recognition processing of firstphase). A local recognition range is set around a characteristic part onthe object on the basis of the recognition result. High-resolutionobject recognition processing is partially executed for only the localrecognition range (recognition processing of second phase). Thecharacteristic part on the object includes, e.g., a hole for a screw orrod, a projection for assembly, and a mark on the object surface. Theposition and orientation of the entire target object are calculated onthe basis of the object recognition result in the local recognitionrange.

However, the arrangement disclosed in Japanese Patent Laid-Open No.09-282454 requires a predetermined time between the recognitionprocessing of the first phase and the recognition processing of thesecond phase. For this reason, it is difficult to accurately recognizean object in an environment where the image sensing conditionsdynamically change due to, e.g., variations in illumination conditions,variations in size and shape of the recognition target object, androtation of the recognition target object.

Hence, if the recognition target object is a human face, and the facialexpression at a given point of time should be recognized, theconventional technique mentioned above cannot be used.

On the other hand, there is another conventional technique whichanalyzes the image data of a sensed face image and recognizes the eyeregion of the recognition target in the sensed image on the basis of theanalysis result.

Japanese Patent No. 3452685 discloses a face image processing technique.In this technique, only a low luminance value is extracted from a faceimage by using a filter to extract a low luminance value and binarized.The barycenter of the binary image is calculated. The barycentricposition is set as the barycentric position of the face. An eyeexistence region is set on the basis of the barycentric position. Atleast one eye existence candidate region is set in the existence region.The candidate regions allow to determine the eye region.

The face image processing technique disclosed in Japanese Patent No.3452685 is implemented to process an image which contains only a face.Hence, if a background is present in the image, the face barycentricposition may be recognized as a position far from the true position. Inthis case, the eye region cannot be set correctly. When setting a regionby the technique disclosed in Japanese Patent No. 3452685, the distancebetween the camera and the object is measured in advance, and the eyeregion is set on the basis of the measured distance, independent of thesize of the face of the object. For this reason, correct region settingmay be impossible for an arbitrary face size. Correct region setting mayalso be impossible when a variation such as rotation occurs.

SUMMARY OF THE INVENTION

The present invention has been made in consideration of theabove-described problems, and has as its object to provide a techniqueof accurately recognizing an object even in an environment where imagesensing conditions dynamically change. It is another object of thepresent invention to provide a technique of accurately recognizing aface under various image sensing conditions.

In order to achieve the above object, an information processingapparatus according to the present invention has the followingarrangement. The information processing apparatus comprises:

an input unit adapted to input image data containing a face;

a first detection unit adapted to detect, from the image data, aposition of a specific part of the face;

a second detection unit adapted to detect a feature point of the facefrom the image data on the basis of the detected position of thespecific part; and

a determination unit adapted to determine facial expression of the faceon the basis of the detected feature point,

wherein the second detection unit has higher detection accuracy thandetection accuracy of the first detection unit, and the first detectionunit is robust to a variation in a detection target.

In order to achieve the above object, a control method for aninformation processing apparatus according to the present invention hasthe following arrangement. The control method for an informationprocessing apparatus for processing image data containing a face,comprises steps of:

inputting image data containing a face;

detecting, from the image data, a position of a specific part of theface;

detecting a feature point of the face from the image data on the basisof the detected position of the specific part; and

determining facial expression of the face on the basis of the detectedfeature point,

wherein the second detection step has higher detection accuracy thandetection accuracy of the first detection step, and the first detectionstep is robust to a variation in a detection target.

Further features of the present invention will become apparent from thefollowing description of exemplary embodiments with reference to theattached drawings.

BRIEF DESCRIPTION OF THE DRAWINGS

The accompanying drawings, which are incorporated in and constitute apart of the specification, illustrate embodiments of the invention and,together with the description, serve to explain the principles of theinvention.

FIG. 1 is a block diagram showing the functional arrangement of aninformation processing apparatus according to the first embodiment;

FIG. 2 is a schematic view showing a neural network;

FIG. 3 is a view schematically showing histogram correction processing;

FIG. 4 is a view showing the connection relationship between a neuron ofa given layer feature and a plurality of neurons of the preceding layerfeature;

FIG. 5 is a view showing the connection relationship to preceding layerneurons necessary for calculating adjacent neurons of a given layerfeature;

FIG. 6 is a block diagram showing the detailed functional arrangement ofa facial expression recognition unit;

FIG. 7 is a view showing the arrangement of a CNN to extract featurepoints;

FIG. 8 is a schematic view showing feature points to be extracted;

FIG. 9 is a schematic view showing face, left/right eye, and mouthbarycentric positions obtained from the CNN to detect a face position;

FIG. 10 is a schematic view showing a nostril barycentric positioncalculation range to obtain a nostril position, a product-sum operationrange necessary for obtaining a barycenter calculation range, and aninput image range necessary for obtaining the barycenter calculationrange;

FIG. 11 is a schematic view showing the left and right nostril positionsand the subnasal edge;

FIGS. 12A, 12B, and 12C are schematic views showing receptive fieldsnecessary for calculating the barycenter of the left and right nostrilpositions, the barycenter of the right nostril position, and thebarycenter of the subnasal edge;

FIG. 13 is a schematic view showing a barycenter calculation range toobtain left and right eyebrow end feature points;

FIG. 14 is a schematic view showing a barycenter calculation range toobtain left and right eyebrow median feature points;

FIG. 15 is a schematic view showing a barycenter calculation range toobtain left and right eye end feature points;

FIG. 16 is a schematic view showing a barycenter calculation range toobtain the feature points of the upper and lower edges of left and righteyes;

FIG. 17 is a schematic view showing a barycenter calculation range toobtain a mouth end feature point;

FIG. 18 is a schematic view showing a barycenter calculation range toobtain the feature points of the upper and lower edges of the mouth;

FIG. 19 is a view showing forehead, glabella, and cheek regions;

FIG. 20 is a view showing a minimum input image region necessary forobtaining all feature points;

FIG. 21 is a view showing the barycentric positions of the left andright eye regions and face region used to detect size variation androtational variation;

FIG. 22 is a view showing the barycentric positions of the left andright eye regions and face region when size variation has occurred;

FIG. 23 is a view showing the barycentric positions of the left andright eye regions and face region when horizontal rotational variationhas occurred;

FIG. 24 is a schematic view showing the barycentric positions of theleft and right eye regions and face region when vertical rotationalvariation has occurred;

FIG. 25 is a schematic view showing the arrangement of a CNN todetermine facial expression;

FIG. 26 is a table showing the weights of feature amount variations incalculating scores from the feature amount variations to determinefacial expression “joy”;

FIG. 27 is a graph showing the distribution of scores calculated fromthe feature amount variations;

FIG. 28 is a graph showing a score distribution template prepared inadvance for facial expression “joy”;

FIG. 29 is a flowchart showing the procedure of overall processingaccording to the first embodiment;

FIG. 30 is a block diagram showing the functional arrangement of aninformation processing apparatus according to the second embodiment;

FIG. 31 is a block diagram showing the functional arrangement of afacial expression recognition unit;

FIG. 32 is a schematic view showing a vector that has the initial pointat the face detection position and the end point at the right lateralcanthus feature point in t [frame] and t+1 [frame] images;

FIG. 33 is a schematic view showing calculation of a motion vector;

FIG. 34 is a view showing the intercanthal distance and the horizontaland vertical components of the vector that has the initial point at theface detection position and the end point at the right lateral canthusfeature point;

FIG. 35 is a view showing the intercanthal distance and the horizontaland vertical components of the vector that has the initial point at theface detection position and the end point at the right lateral canthusfeature point when size variation has occurred;

FIG. 36 is a flowchart showing the procedure of overall processingaccording to the second embodiment;

FIG. 37 is a block diagram showing the functional arrangement of aninformation processing apparatus according to the third embodiment;

FIG. 38 is a flowchart showing the procedure of overall processingaccording to the third embodiment;

FIG. 39 is a block diagram schematically showing the hardwareconfiguration of the information processing apparatuses according to thefirst to third embodiments;

FIG. 40 is a view showing the contents of a table 113; and

FIG. 41 is a view showing the contents of a table 313.

DESCRIPTION OF THE EMBODIMENTS

The embodiments of the present invention will now be described in detailin accordance with the accompanying drawings. Note that each element inthe following embodiments is not intended to limit the scope of theinvention, but is merely an example.

First Embodiment Hardware Configuration of Information ProcessingApparatus

The hardware configuration of an information processing apparatusaccording to this embodiment will be described first with reference toFIG. 39. FIG. 39 is a block diagram schematically showing the hardwareconfiguration of the information processing apparatus of thisembodiment. The information processing apparatus according to thisembodiment is implemented by, e.g., a personal computer (PC),workstation (WS), or personal digital assistant (PDA).

Referring to FIG. 39, a CPU 390 executes application programs, operatingsystem (OS), and control programs stored in a hard disk (to be referredto as an HD hereinafter) 395 (to be described later). The CPU 390 alsocontrols to temporarily store, in a RAM 392, information and filesnecessary for program execution.

A ROM 391 stores programs including a basic I/O program and variouskinds of data such as font data and template data used in documentprocessing. The RAM 392 temporarily stores various kinds of data andfunctions as the main memory and work area of the CPU 390.

An external storage drive 393 that implements access to a recordingmedium can load, e.g., a program from a medium (recording medium) 394 tothe computer system. The medium 394 may be an arbitrary medium such as aflexible disk (FD), CD-ROM, CD-R, CD-RW, PC card, DVD, IC memory card,MO, or memory stick.

In this embodiment, the external storage device 395 comprises an HD thatfunctions as a mass storage device. The HD 395 stores applicationprograms, OS, control programs, and related programs.

An instruction input device 396 is implemented by a device such as akeyboard, pointing, device (e.g., mouse), and touch panel. The userinputs, to the information processing apparatus of this embodiment,e.g., a command to control it by using the instruction input device 396.

A display 397 displays a command input from the instruction input device396 or a response output of the information processing apparatus to thecommand.

A system bus 399 manages the data flow in the information processingapparatus.

An image sensing device 398 senses an object and acquires image data.The image sensing device 398 comprises components such as an imagingoptical system, solid-state image sensing element, and video signalprocessing circuit to execute A/D conversion and the like. The imagesensing device 398 acquires digital image data by A/D-converting anelectrical signal obtained from a CCD or CMOS sensor serving as asolid-state image sensing element. The image data acquired by the imagesensing device 398 is subjected to buffering processing under thecontrol of the CPU 390 and transferred to a memory such as the RAM 392by DMA.

Software that implements the same functions as the above-describedhardware devices may be used instead.

In an example of this embodiment, programs and related data according tothe embodiment are directly loaded from the medium 394 to the RAM 392and executed. The programs of this embodiment may be installed in the HD395 in advance and loaded from there to the RAM 392 every time theprograms of this embodiment run. Alternatively, the programs of thisembodiment may be recorded in the ROM 391 as part of the memory map anddirectly executed by the CPU 390.

The information processing apparatus of this embodiment is implementedby a single apparatus for the descriptive convenience. However, theresources may be distributed to a plurality of apparatuses. For example,the storage and operation resources may be distributed to a plurality ofapparatuses. The resources may be distributed to virtual constituentelements on the information processing apparatus to perform parallelprocessing.

[Functional Arrangement of Information Processing Apparatus]

The functional arrangement for object recognition by the above-describedinformation processing apparatus will be described next with referenceto FIG. 1. FIG. 1 is a block diagram showing the functional arrangementof the information processing apparatus according to this embodiment.

Functional blocks shown in FIG. 1 are implemented as the CPU 390 of theinformation processing apparatus that is described above with referenceto FIG. 39 executes programs loaded to the RAM 392 and cooperates witheach hardware shown in FIG. 1. Some or all of the functional blocks maybe implemented by dedicated hardware.

Referring to FIG. 1, an image input unit 100 senses an object andacquires image data. The image input unit 100 corresponds to the imagesensing device 398 in FIG. 39. The image input unit 100 acquires imagedata and buffers it in a memory such as the RAM 392.

In this embodiment, image data input by the image input unit 100 is dataof a face image. In this embodiment, image data is data of a movingimage containing a plurality of frames.

A face position detection unit 101 specifies the position of a face,i.e., an object as a position and orientation calculation target. Theface position detection unit 101 specifies the face position by using amultilayered neural network (first CNN) that is schematically shown inFIG. 2. FIG. 2 is a schematic view of the neural network.

In this embodiment, a face position in a digital image is specifiedparticularly by using a convolutional neural network (to be referred toas a CNN hereinafter) as a neural network. The CNN is a known techniquedisclosed in, e.g., M. Matsugu, K. Mori, M. Ishii, and Y. Mitarai,“Convolutional Spiking Neural Network Model for Robust Face Detection”,9th International Conference on Neural Information Processing, pp.660-664, November 2002. The CNN is implemented by cooperation ofhardware and programs in the information processing apparatus of thisembodiment. The operation of the face position detection unit 101 willbe described later in detail.

A facial expression recognition unit 102 has an arrangement shown inFIG. 6. FIG. 6 is a block diagram showing the detailed functionalarrangement of the facial expression recognition unit 102. As shown inFIG. 6, the facial expression recognition unit 102 comprises apredetermined feature amount extraction unit 110, feature amountvariation calculation unit 111, and facial expression determination unit112. The facial expression determination unit 112 causes neurons tolearn facial expression determination by looking up a table 113containing the correspondence between feature amounts and facialexpressions.

The arrangement of this embodiment uses two networks: a CNN (first CNN)to make the face position detection unit 101 detect a face position onthe basis of an image and a CNN (second CNN) to make the facialexpression recognition unit 102 obtain feature points necessary forrecognizing facial expression.

The predetermined feature amount extraction unit 110 extractspredetermined feature amounts necessary for recognizing facialexpression on the basis of an image sensing target's face positiondetected by the face position detection unit 101. The feature amountvariation calculation unit 111 normalizes feature amount variations inaccordance with variations in the feature amounts extracted by thepredetermined feature amount extraction unit 110. In this normalization,the positions of feature points are corrected on the basis of theirlayout in image data. The facial expression determination unit 112determines the facial expression on the basis of the feature amountvariations normalized by the feature amount variation calculation unit111. The predetermined feature amount extraction unit 110, featureamount variation calculation unit 111, and facial expressiondetermination unit 112 included in the facial expression recognitionunit 102 will be described later in detail.

[Overall Processing]

Overall processing executed by the arrangement of this embodiment willbe described next with reference to FIG. 29. FIG. 29 is a flowchartshowing the procedure of overall processing according to thisembodiment.

In step S270, the face position detection unit 101 executes decimationand histogram correction of image data acquired by the image input unit100. The image resolution after decimation is, e.g., 360×240 [pixels].

In step S271, the face position detection unit 101 determines a faceposition in the image by using the CNN. The resolution of the inputimage to the CNN to determine a face position is further reduced to,e.g., 180×120 [pixels] by decimation.

In step S272, the facial expression recognition unit 102 determineswhether a face is detected. If a face is detected (YES in step S272),the process advances to step S273. If no face is detected (NO in stepS272), the process returns to step S270 to execute the same processingfor the image data of the next frame.

In step S273, the predetermined feature amount extraction unit 110 setsa nostril feature point extraction range by using face and eye positionsextracted by the first CNN for face position detection.

In step S274, the predetermined feature amount extraction unit 110extracts a nostril feature point on the basis of the extraction rangeset in step S273.

In step S275, the predetermined feature amount extraction unit 110 setsfeature point extraction ranges except the nostril feature point byusing eye and mouth positions acquired using the CNN to determine theface position and the nostril feature point position extracted in stepS274.

In step S276, the predetermined feature amount extraction unit 110extracts feature points by using the second CNN on the basis of theextraction ranges set in step S275. The resolution of the input image tothe second CNN to extract feature points is, e.g., 360×240 [pixels].

In step S277, the predetermined feature amount extraction unit 110determines whether all feature points are extracted by the processing insteps S273 to S276. If all feature points are extracted (YES in stepS277), the process advances to step S278. If not all feature points areextracted (NO in step S277), the process returns to step S270 to executethe same processing for the next frame.

In step S278, the feature amount variation calculation unit 111calculates feature amount variations by comparison with anexpressionless reference face prepared in advance and normalizes them inaccordance with variations. That is, the positions of the feature pointsare corrected on the basis of their layout in the image data. The dataof the expressionless reference face is stored in a storage device suchas the HD 395 in advance.

In step S279, the facial expression determination unit 112 determinesfacial expression by using an NN for facial expression determination.Note that the NN indicates a neural network.

Processing in each step will be described below in detail by explainingprocessing in each functional arrangement.

[Face Position Detection Unit 101]

The function of the face position detection unit 101 will be describedin detail. The face position detection unit 101 detects the position(face position) of a specific part of a face in image data on the basisof the outline of the face.

The face position detection unit 101 acquires image data stored in thebuffer by the image input unit 100 and performs, as preprocessing,resolution change by decimation and histogram correction to reduce theinfluence of illumination conditions. The face position detection unit101 inputs the corrected image data to the CNN.

As described above, image data acquired by the image input unit 100 istemporarily stored in the buffer. The face position detection unit 101reads out the image data from the buffer every other pixel bydecimation. For example, if the resolution of the buffered image data is720×480 [pixels], the face position detection unit 101 acquires imagedata with a resolution of 360×240 [pixels] by decimation.

Next, histogram correction to be described below is executed. Aluminance value histogram 130 of the input image is created, as shown inFIG. 3. FIG. 3 is a view schematically showing histogram correction. Theluminance value histogram 130 indicates the distribution of theluminance values of the pixels of the input image (image data). Theabscissa represents the luminance value, and the ordinate represents thenumber of pixels (degree).

Luminance values X 131 and Y 132 (maximum and minimum luminance values)at the ends of the curve are extracted from the luminance valuehistogram. The luminance values are converted by using a nonlinearfunction 133 such that the extracted luminance values 131 and 132 at theends of the curve are, e.g., 255 and 0, respectively. A function thatreduces an influence of illumination conditions such as shade, i.e.,enhances the tone of a low-luminance region is selected and set in theinformation processing apparatus in advance as the nonlinear function.

When the luminance is corrected to enhance the tone of the low-luminanceregion in the above-described way, image recognition can accurately bedone independently of the image sensing conditions.

Histogram correction may be done by any other method. For example, upperand lower limit luminance values are set in advance. Pixels withluminance values smaller than the lower limit value are converted into aluminance value “0”. Pixels with luminance values equal to or largerthan the upper limit value are converted into a luminance value “255”.Pixels with luminance values between the lower and upper limit valuesare appropriately converted on the basis of the pixels having luminancevalues equal to or smaller than the lower limit value or pixels havingluminance values equal to or larger than the upper limit value. Thisconversion method can also be applied.

Each layer feature of the CNN includes a number of neurons. In thisembodiment, one neuron output represents the feature detection result ofone pixel of image data. For example, consider a case

wherein only one preceding layer feature is connected to a given layerfeature (the sub-sampling layer or feature pooling layer of the CNN). Inthis case, the internal state value of one neuron 121 of a layer can beobtained by the product-sum operation of a plurality of neurons 120 ofthe preceding layer feature and weighting factor data corresponding tothem, as shown in FIG. 4. FIG. 4 is a view showing the connectionrelationship between a neuron of a given layer feature and a pluralityof neurons of the preceding layer feature.

The number of neurons of the preceding layer to which one neuron isconnected changes depending on the receptive field size of each featurethat is decided to extract a specific feature. For example, if thereceptive field size necessary for obtaining a certain feature is 3×5,an internal state value 124 of one neuron is calculated by the product—sum operation of 3×5 neuron values 122 and 3×5 weighting factors in thepreceding layer, as shown in FIG. 5. FIG. 5 is a view showing theconnection relationship to preceding layer neurons necessary forcalculating adjacent neurons of a given layer feature.

A neuron value 125 immediately adjacent to the neuron internal statevalue 124 can be calculated by the product-sum operation of weightingfactors and a plurality of neurons 123 of a region that is shifted fromthe plurality of neurons 122 by one pixel in the preceding layer. Thatis, a convolutional operation is executed by vertically and horizontallyshifting by one pixel a region called a receptive field in the precedinglayer and repeating the product-sum operation of a weighting factor dataset and a plurality of neuron values located in each receptive field.With this processing, the internal state values of all neurons in thecurrent layer can be obtained. If a plurality of preceding layerfeatures are connected to a given layer feature (the feature detectionlayers of the CNN), as shown in FIG. 2, the sum of internal state valuesobtained in the connected preceding layer features is equivalent to theinternal state value of one neuron.

The weighting factor data is obtained by learning using supervisory datagiven in advance. CNNs (layer features) having various characteristicscan be created in accordance with supervisory data. For example, whenlearning is done by giving various variations such as illuminationvariation, size variation, and rotational variation to the supervisorydata group of the CNN to detect a face position, the position detectionaccuracy degrades as compared to a case wherein learning is done bygiving only a specific variation such as only illumination variation.Instead, a face detection CNN (layer feature) robust to these variationscan be created. Alternatively, a layer feature capable of accuratelydetecting, e.g., only a V-shaped eye end position can be created bygiving only data of V-shaped eye end points as supervisory data group.

Each layer of the CNN according to this embodiment will be described.The resolution of the input image to the input layer shown in FIG. 2,i.e., image data input to the CNN that specifies a face position inimage data is lowered to 180×120 [pixels] by decimation to reduce theprocessing load.

The CNN of this embodiment has three layers, as shown in FIG. 2. Thefirst layer level (first layer 201) extracts a total of four features:oblique (diagonal-right-up and diagonal-right-down) edges, horizontaledge, and vertical edge to recognize the outline of a face. The secondlayer level (second layer 202) extracts eye and mouth position features.

The third layer level (third layer 203) extracts a face position. Theface position includes specific parts defined in advance in a faceimage, i.e., eye region barycentric positions 160 and 161, mouth regionbarycentric position 163, face region barycentric position 162, andnostril position (to be described later), as shown in FIG. 9. FIG. 9 isa schematic view showing face, left/right eye, and mouth barycentricpositions obtained from the CNN to detect a face position.

That is, the network arrangement of the CNN according to this embodimentextracts medium-order feature (eyes and mouth) positions by combining aplurality of lower-order feature (edge level) detection results and thenextracts a higher-order feature (face position) position from themedium-order feature (eyes and mouth) detection results.

As described above, these features are detected because weightingfactors that are learned by using supervisory data in advance are used.Supervisory data used for learning in the CNN to detect a face isgenerated on the basis of image data of various variations such as sizevariation, rotational variation, illumination variation, and shapevariation. Hence, a robust network capable of detecting face, eye, andmouth positions even in case of the plurality of variations is built.

Image data learning can be done for, e.g., a single object (face) on thebasis of images obtained in a changing environment under the followingconditions.

(1) The size varies up to three times.

(2) Rotational variation occurs within 45° in the vertical, horizontal,and depth directions.

(3) Rotational variation in plane occurs within 45° in the horizontaldirection.

(4) Illumination conditions vary in image sensing under indoor andoutdoor illumination environments.

(5) The shapes of eyes and mouth vary in the vertical and horizontaldirections.

The network can be designed to learn such that the peripheral regions ofthe barycenters of the eyes, mouth, and face are regarded as theircorrect solution positions. That is, the correct solution positions ofthe eye, mouth, and face can be obtained by executing thresholdprocessing of the product-sum operation results of the eye, mouth, andface detection positions and calculating the barycentric positions oflocal regions equal to or more than the threshold value. The positionsof the eyes and mouth are decided only when the face position isdecided. That is, in the product-sum operation and threshold processingto detect the eye and mouth positions, candidates for the eye and mouthpositions are detected. Only when the face position is decided by theproduct-sum operation and threshold processing to decide the faceposition, the eye and mouth positions are decided.

The number of layers, the number of features, and the connectionrelationship between features of the CNN may be changed. Another methodusing, e.g., the maximum neuron value except threshold processing andbarycenter calculation may calculate position information based on theneuron values of eyes, mouth, and face features. The resolution of imagedata obtained from the image input unit is not limited to 720×480[pixels]. The resolution of the input image to the CNN to detect a faceposition is not limited to 180×120 [pixels], either.

[Predetermined Feature Amount Extraction Unit 110]

The predetermined feature amount extraction unit 110 included in thefacial expression recognition unit 102 will be described next. Thepredetermined feature amount extraction unit 110 sets a region in imagedata on the basis of the face position detected by the face positiondetection unit 101, as will be described later in detail. Thepredetermined feature amount extraction unit 110 searches for facefeature points in the set region and then determines facial expressionon the basis of the found feature points.

As described above, the arrangement of this embodiment uses twonetworks: a CNN (first CNN) to make the face position detection unit 101detect a face position on the basis of an image and a CNN (second CNN)to make the facial expression recognition unit 102 obtain feature pointsnecessary for recognizing facial expression. The predetermined featureamount extraction unit 110 causes the second CNN to extract featurepoints necessary for facial expression recognition on the basis of theinput image and the eye, mouth, and face detection positions obtained bythe face position detection unit 101. The second CNN to extract featurepoints necessary for facial expression recognition has an arrangementshown in FIG. 7. FIG. 7 is a view showing the arrangement of the CNN toextract feature points.

The input image to the second CNN to extract feature points is thehistogram-corrected image obtained by preprocessing of the first CNNthat specifies the face position. The image resolution is 360×240[pixels]. The second CNN to extract feature points processes an inputimage with a high resolution of 360×240 [pixels] without decimation,unlike the first CNN to detect a face position. This is because featurepoints existing in small regions in the image region must be extractedaccurately. The input image resolution of the second CNN to extractfeature points is not limited to 360×240 [pixels].

The second CNN to extract feature points has two layer levels (701 and702), as shown in FIG. 7. The first layer level 701 extracts a total offour features: oblique (diagonal-right-up and diagonal-right-down)edges, horizontal edge, and vertical edge. To extract feature points(left and right eyebrow feature points 140 to 145, left and right eyefeature points 146 to 153, nostril feature point 154, and mouth featurepoints 155 to 158) necessary for facial expression recognition, thesecond layer level 702 prepares one feature of the CNN in correspondencewith each feature point, as shown in FIG. 8. FIG. 8 is a schematic viewshowing feature points to be extracted.

Even the second CNN to extract feature points can accurately acquire thefeature points by using weighting factors obtained by learning based onsupervisory data, like the first CNN to detect a face. The second CNN toextract feature points uses learning data of only a specific variation,unlike the first CNN to detect a face position. Hence, the featureposition detection accuracy of the second CNN to extract feature pointsis very high although it has no high detection robustness of the firstCNN to detect a face.

In this embodiment, learning is performed using images with onlyspecific variations, i.e., eye and mouth shape variations andillumination variation. However, the present invention is not limited tothis. For example, learning based on images with only illuminationvariation may be done using images acquired by changing the illuminationvariation width without lowering the feature point extraction accuracy,i.e., images under various illumination environments. Leaning may beexecuted using images with only other specific variations such asillumination variation and size variation. A feature for a singlefeature point may be prepared in correspondence with each of sizevariation, rotational variation, and illumination variation. The numberof layers, the number of features, and the connection relationshipbetween features of the second CNN to extract feature points may bechanged, like the first CNN to detect a face. The CNN to extract featurepoints need not always extract one feature point from one feature.Feature points of similar features such as the right eye lateral canthus(V-shape) and left eye medial canthus (V-shape) may be extracted fromthe same feature of the CNN.

The predetermined feature amount extraction unit 110 restricts theprocessing region of each feature of each layer and executes operationby using the second CNN for extracting feature points. Morespecifically, the predetermined feature amount extraction unit 110decides a processing region restriction range to extract each featurepoint on the basis of the face position calculated by the first CNN(face position detection unit 101) for detecting a face position. Theface position includes, e.g., the eye region barycentric positions 160and 161, mouth region barycentric position 163, face region barycentricposition 162, and nostril position (to be described later), as shown inFIG. 9.

(Region Restriction Processing)

Region restriction processing executed by the predetermined featureamount extraction unit 110 to extract the nostril barycentric positionwill be described next in detail with reference to FIG. 10. FIG. 10 is aschematic view showing a nostril barycentric position calculation range(barycenter calculation range) to obtain a nostril position, aproduct-sum operation range necessary for obtaining the barycentercalculation range, and an input image range necessary for obtaining thebarycenter calculation range.

Referring to FIG. 10, a region 173 denotes a barycenter calculationrange. As shown in FIG. 10, the barycenter calculation range 173 is arectangular region having a horizontal range decided on the basis of aright eye detection position 170 and a left eye detection position 171.The vertical range of the barycenter calculation range 173 is decided onthe basis of the right eye detection position 170 or left eye detectionposition 171 and a mouth detection position 172.

The barycenter calculation range 173 is used to calculate a barycentricposition from obtained neuron values. To calculate a barycenter in thebarycenter calculation range 173, neuron values must exist in thebarycenter calculation range 173. The minimum region of input image datanecessary for ensuring existence of neuron values in the barycentercalculation range 173 can be calculated by using the receptive fieldsize to detect a nostril and the receptive field size of each feature ofthe first layer.

More specifically, to obtain neuron values in the nostril positionbarycenter calculation range 173, the feature neuron values of the firstlayer of a region 174 extended by ½ the receptive field size to detect anostril are necessary. Hence, each feature of the first layer levelrequires the neuron values of the region 174. To obtain the neuronvalues of the region 174 in the first layer, the input image data of aregion 175 extended by ½ the receptive field size to detect each featureof the first layer is necessary. In this way, the minimum input imagedata region necessary for the nostril position barycenter calculationrange can be calculated. The nostril position can be calculated byexecuting the product-sum operation of the neuron values of thepreceding layer and weighting factors and then threshold processing andbarycentric position detection, as described above, in these restrictedranges.

Any one of a right nostril barycentric position 176, a left nostrilbarycentric position 177, the barycentric position of left and rightnostrils, and a subnasal edge 178 shown in FIG. 11 may be calculated asthe nostril barycentric position. FIG. 11 is a schematic view showingthe left and right nostril positions and the subnasal edge.

For learning, a region including the part to be set as the nostrilposition is set as the receptive field. Learning is done by setting thelearning correct solution point to the barycentric position of theregion including the part to be set as the nostril position. FIGS. 12A,12B, and 12C are schematic views showing receptive fields necessary forcalculating the barycenter of the left and right nostril positions, thebarycenter of the right nostril position, and the barycenter of thesubnasal edge.

For example, to calculate the barycentric position of left and rightnostrils as the nostril position, a region including the left and rightnostrils is set as the receptive field, as shown in FIG. 12A. Leaning isexecuted by setting the learning correct solution point to thebarycentric position of the left and right nostrils. To calculate theright nostril barycentric position 176 as the nostril position, a regionincluding the right nostril is set as the receptive field, as shown inFIG. 12B. Leaning is executed by setting the learning correct solutionpoint to the right nostril barycentric position. To calculate thesubnasal edge 178 as the nostril position, a region including thesubnasal edge is set as the receptive field, as shown in FIG. 12C.Leaning is executed by setting the learning correct solution point tothe subnasal edge barycentric position. In this embodiment, thebarycentric position of left and right nostrils is calculated as thenostril position. The remaining feature points to be described below areexpressed by relative positions to the nostril position.

(Feature Point Barycenter Calculation Range Setting Processing)

Processing of setting a feature point barycenter calculation range toextract feature points except the nostril feature point will bedescribed next with reference to FIGS. 13 to 18 and 20. FIGS. 13 to 18are views showing barycenter calculation ranges and, more specifically,barycenter calculation ranges to obtain left and right eyebrow endfeature points, left and right eyebrow median feature points, left andright eye end feature points, feature points of the upper and loweredges of the left and right eyes, mouth end feature point, and featurepoints of the upper and lower edges of the mouth, respectively. FIG. 20is a view showing a minimum input image region necessary for obtainingall feature points. In the following description, the distance between aright eye detection position 181 and a left eye detection position 182will be defined as L. A horizontal position will be defined as an X-axisposition, and a vertical position will be defined as a Y-axis position.

The barycenter calculation range to extract each feature point of leftand right eyebrows will be described. Referring to FIG. 13, a region 183to extract the feature point 140 in FIG. 8 is defined to include anX-axis region with a horizontal length from “x-coordinate of right eyedetection position 181−L/2” to “x-coordinate of right eye detectionposition 181” and a Y-axis region with a vertical length from“y-coordinate of right eye detection position 181−L/2” to “y-coordinateof right eye detection position 181”. A region 184 to extract thefeature point 142 in FIG. 8 is defined to include an X-axis region witha horizontal length from “x-coordinate of right eye detection position181” to “x-coordinate of nostril position 180” and a Y-axis region witha vertical length from “y-coordinate of right eye detection position181−L/2” to “y-coordinate of right eye detection position 181”.

Referring to FIG. 14, a region 187 to extract the feature point 141 inFIG. 8 is defined to include an X-axis region with a horizontal lengthfrom “x-coordinate of right eye detection position 181−L/4” to“x-coordinate of right eye detection position 181+L/4” and a Y-axisregion with a vertical length from “y-coordinate of right eye detectionposition 181−L/2” to “y-coordinate of right eye detection position 181”.Left eyebrow feature point extraction regions 185, 186, and 188 are setlike the right eyebrow feature point extraction regions 183, 184, and187.

The barycenter calculation range to extract each feature point of leftand right eyes will be described next. Referring to FIG. 15, a region189 to extract the feature point 146 in FIG. 8 is defined to include anX-axis region with a horizontal length from “x-coordinate of right eyedetection position 181−L/2” to “x-coordinate of right eye detectionposition 181” and a Y-axis region with a vertical length from“y-coordinate of right eye detection position 181−L/2” to “y-coordinateof right eye detection position 181+L/2”. A region 190 to extract thefeature point 149 in FIG. 8 is defined to include an X-axis region witha horizontal length from “x-coordinate of right eye detection position181” to “x-coordinate of nostril position 180” and a Y-axis region witha vertical length from “y-coordinate of right eye detection position181−L/2” to “y-coordinate of right eye detection position 181+L/2”.

Referring to FIG. 16, a region 193 to extract the feature point 147 isdefined to include an X-axis region with a horizontal length from“x-coordinate of right eye detection position 181−L/8” to “x-coordinateof right eye detection position 181+L/8” and a Y-axis region with avertical length from “y-coordinate of right eye detection position181−L/4” to “y-coordinate of right eye detection position 181”. A region194 to extract the feature point 148 in FIG. 8 is defined to include anX-axis region with a horizontal length from “x-coordinate of right eyedetection position 181−L/8” to “x-coordinate of right eye detectionposition 181+L/8” and a Y-axis region with a vertical length from“y-coordinate of right eye detection position 181” to “y-coordinate ofright eye detection position 181+L/4”. Left eye feature point extractionregions 191, 192, 195, and 196 are set like the right eye feature pointextraction regions 189, 190, 193, and 194.

The barycenter calculation range to extract each feature point of amouth will be described next. The distance between the nostril position180 and a mouth detection position 197 in FIG. 17 will be defined as L₁.A horizontal position will be defined as an X-axis position, and avertical position will be defined as a Y-axis position, as in the abovedescription.

Referring to FIG. 17, a region 198 to extract the feature point 155 inFIG. 8 is defined to have a horizontal length from “x-coordinate ofmouth detection position 197−2L/3” to “x-coordinate of mouth detectionposition 197” and a vertical length from “y-coordinate of mouthdetection position 197−L” to “y-coordinate of mouth detection position197+L”. A region 199 to extract the feature point 158 in FIG. 8 isdefined to have a horizontal length from “x-coordinate of mouthdetection position 197” to “x-coordinate of mouth detection position197+2L/3” and a vertical length from “y-coordinate of mouth detectionposition 197−L₁” to “y-coordinate of mouth detection position 197+L₁”

Referring to FIG. 18, a region 200 to extract the feature point 156 inFIG. 8 is defined to have a horizontal length from “x-coordinate ofmouth detection position 197−L/4” to “x-coordinate of mouth detectionposition 197+L/4” and a vertical length from “y-coordinate of nostrilposition 180” to “y-coordinate of mouth detection position 197”. A firstlayer 201 to extract the feature point 157 in FIG. 8 is defined to havea horizontal length from “x-coordinate of mouth detection position197−L/4” to “x-coordinate of mouth detection position 197+L/4” and avertical length from “y-coordinate of mouth detection position 197” to“y-coordinate of mouth detection position 197+L₁”.

As described above, the predetermined feature amount extraction unit 110decides each barycenter calculation range to extract a feature point onthe basis of the image sensing target's face position detected by theface position detection unit 101. A minimum and necessary input imagedata region like a hatched region 210 in FIG. 20 in the input image iscalculated by using the receptive field size to obtain each featurepoint and the receptive field size of each feature of the first layer inthe above-described way. Since the regions are restricted, theprocessing load on the CNN in feature point extraction can be reduced.

The above-described arrangement sets regions to extract feature pointson the basis of the face detection position, left and right eyedetection positions, mouth detection position, and nostril positionobtained by the first CNN for face detection in the current frame.However, the present invention is not limited to this. For example, thefeature points may be extracted on the basis of those extracted in thepreceding frame (e.g., the nostril position and feature points extractedin the preceding frame). Alternatively, the regions may be set on thebasis of a plurality of positions between feature points. The presentinvention is not limited to the above-described region setting range.

In the above description, feature point coordinates are expressed asrelative positions to the nostril detection position (feature point 154in FIG. 8). However, the present invention is not limited to this. Forexample, feature point coordinates may be expressed as relativepositions to the face detection position or medial canthus feature point(feature point 149 or 150 in FIG. 8).

(Feature Amounts)

Feature amounts necessary for recognizing facial expression fromobtained feature points will be described next with reference to FIGS. 8and 19. FIG. 19 is a view showing forehead, glabella, and cheek regions.

In this embodiment, the following feature amounts are extracted and usedfor facial expression recognition. The feature amounts listed below aremerely examples, and any other values can be used as feature amounts inaccordance with the use and purpose.

The shapes of eyebrows (e.g., the angle (tilt) made by the line segmentconnecting the feature points 140 and 141 and the line segmentconnecting the feature points 141 and 142 and/or the angle (tilt) madeby the line segment connecting the feature points 143 and 144 and theline segment connecting the feature points 144 and 145 in FIG. 8).

The distance between left and right eyebrows (the distance between thefeature points 142 and 143 in FIG. 8).

The distances between eyebrows and eyes (the distance between thefeature points 140 and 146, the distance between the feature points 141and 147, the distance between the feature points 142 and 149, thedistance between the feature points 143 and 150, the distance betweenthe feature points 144 and 151, and the distance between the featurepoints 145 and 153 in FIG. 8).

The distances between eye ends and mouth ends (the distance between thefeature points 146 and 155 and the distance between the feature points153 and 158 in FIG. 8).

The distances between eye ends (the distance between the feature points146 and 149 and the distance between the feature points 150 and 153 inFIG. 8).

The distances between the upper and lower edges of eye regions (thedistance between the feature points 147 and 148 and the distance betweenthe feature points 151 and 152 in FIG. 8).

The distance between mouth ends (the distance between the feature points155 and 158 in FIG. 8).

The distance between the upper and lower edges of a mouth region (thedistance between the feature points 156 and 157 in FIG. 8).

Wrinkles in forehead and glabella regions (the edge densities of regions220 and 221 in FIG. 19).

Wrinkles in left and right cheek regions (the edge densities of regions222 and 223 in FIG. 19).

The forehead and glabella region 220 in FIG. 19 is, e.g., a rectangularregion including an X-axis region with a horizontal length from“x-coordinate of right eye detection position 181” to “x-coordinate ofnostril position 180” and a Y-axis region with a vertical length from“y-coordinate of right eye detection position 181-2L/3” to “y-coordinateof right eye detection position 181”. The distance between the right eyedetection position 181 and the left eye detection position 182 is L. Thecheek region 222 is, e.g., a rectangular region including an X-axisregion with a horizontal length from “x-coordinate of nostril position180−L” to “x-coordinate of nostril position 180” and a Y-axis regionwith a vertical length from “y-coordinate of nostril position 180−L/4”to “y-coordinate of mouth detection position 197”.

An edge density can be calculated by, e.g., counting the number ofpixels contained in an edge in the region on the basis of the result ofedge feature extraction by the first layer of the CNN and dividing thenumber of pixels by the area of the region.

[Feature Amount Variation Calculation Unit 111]

The feature amount variation calculation unit 111 will be describednext. The feature amount variation calculation unit 111 calculates thevariation of each feature amount by calculating the ratio of eachfeature amount between an expressionless face image prepared in advanceand the face image of the current frame. The feature amount variationcalculation unit 111 also normalizes feature amount variations inaccordance with size and rotational variations of the face in the image.As described above, the normalization corrects the positions of featurepoints on the basis of their layout in image data.

Variations are detected on the basis of a distance a1 between thedetection position of a right medial canthus feature point 230 and amedial canthus median point 233, a distance b1 between the detectionposition of a left medial canthus feature point 231 and the medialcanthus median point 233, and a distance c1 between the detectionposition of a nostril position 232 and the medial canthus median point233, as shown in FIG. 21. The distance between the right medial canthusfeature point and the medial canthus median point 233, the distancebetween the left medial canthus feature point and the medial canthusmedian point 233, and the distance between the nostril position and themedial canthus median point 233 in the expressionless face image set(prepared) in advance are represented by a, b, and c, respectively.

Size variation of the face is determined by calculating the ratios ofthe distances a1 (240 in FIG. 22), b1 (241 in FIG. 22), and c1 (242 inFIG. 22) between the detection positions obtained from the current frameshown in FIG. 22 to the distances a, b, and c between the detectionpositions obtained from the preset expressionless face image. FIG. 22 isa view showing the barycentric positions of the left and right eyeregions and face region when size variation has occurred. For example,when a:b:c=a1:b1:c1, and a:a1=1:2, the face size variation is twice. Inthis case, normalization is done by multiplying each calculated featureamount variation by ½.

Horizontal rotational variation of the face can be calculated by, e.g.,comparing a2:b2 (250 and 251 in FIG. 23) in the current frame imageshown in FIG. 23 with a:b in the expressionless frontal face imageprepared in advance. FIG. 23 is a view showing the barycentric positionsof the left and right eye regions and face region when horizontalrotational variation has occurred.

For example, consider recognition of a face turned round to the leftside as shown in FIG. 23. Assume that a:b=5:5 in the expressionlessfrontal face image prepared in advance, and a2:b2=5:3 (250 and 251 inFIG. 23) in the current frame image. In this case, normalization can bedone by multiplying horizontal feature amounts having influence onleftward rotation, i.e., the distance between the left eyebrow ends, thedistance between the left eye ends, and the distance between the mouthends by (a2/b2)/(a/b). The distance between the left eyebrow ends is,e.g., the distance between the feature points 143 and 145 in FIG. 8. Thedistance between the left eye ends is, e.g., the distance between thefeature points 150 and 153 in FIG. 8. The distance between the mouthends is, e.g., the distance between the feature points 155 and 158 inFIG. 8.

The eyebrow shape can be normalized by multiplying a horizontal regioncalculated from the feature points 143 and 144 and a horizontal regioncalculated from the feature points 144 and 145 by (a2/b2)/(a/b).

Vertical rotational variation of the face can be determined on the basisof the ratio of a distance c3 (262 in FIG. 24) in the face image of thecurrent frame to the distance c in the expressionless frontal face imageprepared in advance. FIG. 24 is a schematic view showing the barycentricpositions of the left and right eye regions and face region whenvertical rotational variation has occurred. For example, whena/a3=b/b3=1, and c:c3=2:1, the face is varied only in the verticaldirection. In this case, normalization can be executed by using, asvariations, values obtained by multiplying vertical feature amounts,i.e., the distances between eye ends and mouth ends, the distancesbetween eyebrows and eyes, the distances between the upper and loweredges of eye regions, and the distance between the upper and lower edgesof a mouth region by c3/c.

The distances between eye ends and mouth ends include, e.g., thedistance between the feature points 146 and 155 and the distance betweenthe feature points 153 and 158 in FIG. 8. The distances between eyebrowsand eyes include, e.g., the distance between the feature points 140 and146, the distance between the feature points 141 and 147, the distancebetween the feature points 142 and 149, the distance between the featurepoints 143 and 150, the distance between the feature points 144 and 151,and the distance between the feature points 145 and 153 in FIG. 8. Thedistances between the upper and lower edges of eye regions include,e.g., the distance between the feature points 147 and 148 and thedistance between the feature points 151 and 152 in FIG. 8. The distancebetween the upper and lower edges of a mouth region includes, e.g., thedistance between the feature points 156 and 157 in FIG. 8.

With the above-described arrangement, variations can be detected byusing the right medial canthus feature point, left medial canthusfeature point, and nostril position. Even when both rotation and sizevariation have occurred, feature amounts can be normalized by the sameprocessing (using the right medial canthus feature point, left medialcanthus feature point, and nostril position) as described above. Theabove-described normalization processing is merely an example, and thepresent invention is not limited to this. For example, variations may bedetected by using face parts such as the right eye detection position,left eye detection position, and face detection position or otherfeature points, and feature amount variations may be normalized.

[Facial Expression Determination Unit 112]

The facial expression determination unit 112 will be described next withreference to FIG. 25. FIG. 25 is a schematic view showing thearrangement of a CNN to determine facial expression.

The facial expression determination unit 112 executes determination byusing a three-layer neural network including an input layer 2501 thatreceives feature amount variations normalized by the feature amountvariation calculation unit 111, an intermediate layer 2502, and anoutput layer 2503 that outputs a facial expression determination result,as shown in FIG. 25. In the arrangement of this embodiment, one neuronis assigned to each of feature amount variations to the input layer andfacial expression determination results from the output layer.

The input layer 2501 receives normalized feature amount variations. Inthis embodiment, the input layer 2501 receives, e.g., 22 features.

“Shapes of eyebrows” feature amount variations (4)

“Distance between left and right eyebrows” feature amount variation (1)

“Distances between eyebrows and eyes” feature amount variations (6)

“Distances between eye ends and mouth ends” feature amount variations(2)

“Distances between eye ends” feature amount variations (2)

“Distances between the upper and lower edges of eye regions” featureamount variations (2)

“Distance between mouth ends” feature amount variation (1)

“Distance between the upper and lower edges of mouth region” featureamount variation (1)

“Wrinkles in forehead and glabella regions (edge densities)” featureamount variation (1)

“Wrinkles in left and right cheek regions (edge densities) featureamount variations (2)

The intermediate layer (hidden layer) 2502 executes intermediateprocessing necessary for facial expression determination. In thisembodiment, the intermediate layer 2502 includes 10 neurons (features).

The output layer 2503 determines facial expression on the basis ofinputs from the intermediate layer 2502. In this embodiment, the outputlayer 2503 includes eight features (neurons) to output facialexpressions “joy”, “anger”, “sadness”, “pity”, “expressionless”, “worry”and “surprise”.

When the recognition target face forms certain facial expression,specific feature amount variations increase/decrease. In, e.g., facialexpression “joy”, feature amount variations increase/decrease in thefollowing manner as compared to an expressionless state. The variationsof the distances between the eye ends and the mouth ends (between 146and 155 and between 153 and 158 in FIG. 8) decrease. The variation ofthe distance between the mouth ends (between 155 and 158 in FIG. 8), thevariations of the edge densities of the cheek regions (the edgedensities of the regions 222 and 223 in FIG. 19), and the variations ofthe distances between the lateral and medial canthi (between 146 and 149and between 150 and 153 in FIG. 8) increase.

The facial expression of the recognition target face can be determinedon the basis of the types of the feature amount variations whichincrease or decrease and their increase/decrease amounts. In thisembodiment, a threshold value is set for each feature amount variationin correspondence with each facial expression. The NN is made to learnfacial expression on the basis of comparison between the thresholdvalues and detected feature amount variations. Learning is done suchthat a neuron corresponding to facial expression determined on the basisof the magnitude relationship between the feature amount variations andthe threshold values outputs “1”. The output value range of the outputlayer 2503 is 0 to 1.

For example, the threshold values of the feature amount variations areset in the following way in correspondence with facial expression “joy”.The feature amount variations in the expressionless state are “1”.

The variations of the distances between the eye ends and the mouth ends(between 146 and 155 and between 153 and 158 in FIG. 8): 0.7

The variation (feature amount variation 2) of the distance between themouth ends (between 155 and 158 in FIG. 8): 1.2

The variations (feature amount variation 4) of the edge densities of thecheek regions (the edge densities of the regions 222 and 223 in FIG.19): 1.2

The variations (feature amount variation 5) of the distances between thelateral and medial canthi (between 146 and 149 and between 150 and 153in FIG. 8): 1.1

Remaining feature amount variations: 1.0

The NN learns “joy” when the variations of the distances between the eyeends and the mouth ends are equal to or smaller than the threshold value(0.7), and the variation of the distance between the mouth ends, thevariations of the edge densities of the cheek regions, and thevariations of the distances between the lateral and medial canthi areequal to or larger than the threshold values (1.2, 1.2, and 1.1). Thatis, the NN learns to make the neuron corresponding to “joy” output avalue of “1” or almost “1”. The threshold values are stored in the table113. FIG. 40 is a view showing the contents of the table 113. The facialexpression determination unit 112 controls learning of neurons bylooking up the table 113. The table 113 is defined in a storage devicesuch as the HD 395 in advance.

Learning is done by giving supervisory data to the output layer 2503 ofthe NN in correspondence with input to the input layer. Hence, thefacial expression determination unit 112 can determine facial expressionby referring to the neurons that receive feature amount variations anddetermine facial expression at the output layer.

The arrangement of the input layer 2501, intermediate layer 2502, andoutput layer 2503 is not limited to the above-described arrangement. Forexample, a threshold value may be set in advance for the inputs to theinput layer 2501 and the outputs from the output layer 2503. A valueequal to or larger than the threshold value is defined as “1”, and avalue equal to or smaller than the threshold value is defined as “0” sothat values of “0” and “1” are input or output. The facial expression tobe determined is not limited to “joy”. For example, “anger”, “sadness”,“pity”, “expressionless”, “worry” and “surprise” may be determined.

The output layer of the NN for facial expression determination mayoutput a plurality of features with a strong value (i.e., a value closeto the upper limit value). In this case, facial expression is determinedon the basis of neuron groups that output a strong value. For example,when facial expressions “joy” and “sadness” are obtained, i.e., both theneuron corresponding to “joy” and the neuron corresponding to “sadness”output strong values, the facial expression is determined to be nearly“cry for joy”. When a plurality of neurons included in the output layer2503 output strong values, facial expression determination can be donein, e.g., the following way. A table storing the correspondence betweenneuron groups outputting strong values and facial expressions isprepared in a storage device such as the HD 395. Facial expression canbe determined by looking up this table.

In the above-described arrangement, the determination may be done after,e.g., multiplying the feature amounts by a preset weighting value. Thearrangement for facial expression determination is not limited to thatbased on the above-described method. Facial expression determinationprocessing based on a different method will be described with referenceto FIGS. 26 to 28. FIG. 26 is a table showing the weights (weightingvalues) of feature amount variations in calculating scores from thefeature amount variations to determine facial expression “joy”. FIG. 27is a graph showing the distribution of scores calculated from thefeature amount variations. FIG. 28 is a graph showing a scoredistribution template prepared in advance for facial expression “joy”.

First, as shown in FIG. 26, the feature amount variations are weightedin accordance with each facial expression. Scores are calculated incorrespondence with the feature amounts from the calculated product ofthe weighting values and feature amount variations. A facial expressionscore distribution is created on the basis of the calculated scores. Thecreated facial expression score distribution is compared with a scoredistribution template preset for each facial expression. Facialexpression corresponding to a template having a similar scoredistribution is determined as facial expression indicated by the face asthe recognition target object.

For example, a calculated score distribution to determine facialexpression “joy” is assumed to be the score distribution shown in FIG.27. A preset score distribution template similar to the scoredistribution in FIG. 27 is assumed to be that corresponding to facialexpression “joy” in FIG. 28. In this case, facial expression isdetermined as “joy”.

As described above, in the image sensing device according to thisembodiment, the position (face position) of a specific part of a face inimage data is detected on the basis of the face outline. Regions tosearch for feature points are set on the basis of the detected faceposition. The feature points are searched for not in the entire regionof the image data but only in the set regions. Hence, the searchoperation can be done efficiently.

In the image sensing device according to this embodiment, the faceposition is detected by using low-resolution image data. Feature pointsearch is executed by using high-resolution image data. Since featurepoints can be searched for efficiently and extracted accurately, whichmakes it possible to determine facial expression accurately.

In the image sensing device according to this embodiment, use of twonetworks (neural networks) allows to accurately extract feature pointseven when various kinds of variations have occurred. In addition, even achange in facial expression with very small changes in face features canbe recognized by accurately extracting the feature points.

Second Embodiment

In the first embodiment, the feature amount of an expressionlessreference face registered in advance is compared with the feature amountof a recognition target face. Facial expression is determined on thebasis of calculated feature amount variations. However, the facialexpression determination method is not limited to this. In the secondembodiment, an arrangement will be described in which each frame of ameasured image is analyzed, and a change in facial expression isdetermined on the basis of acquired motion vectors. An informationprocessing apparatus of this embodiment has the same hardwareconfiguration as in the first embodiment.

[Functional Arrangement of Information Processing Apparatus]

The functional arrangement for object recognition according to thisembodiment will be described first with reference to FIG. 30. FIG. 30 isa block diagram showing the functional arrangement of the informationprocessing apparatus according to this embodiment.

As shown in FIG. 30, the functional arrangement of the informationprocessing apparatus of this embodiment includes an image input unit300, face position detection unit 301, and facial expression recognitionunit 302. Processing in the image input unit 300 and face positiondetection unit 301 is the same as in the first embodiment, and adescription thereof will be omitted.

FIG. 31 is a block diagram showing the functional arrangement of thefacial expression recognition unit 302. In this embodiment, the facialexpression recognition unit 302 comprises a predetermined feature pointextraction unit 310, motion vector calculation unit 311, and facialexpression determination unit 312, as shown in FIG. 31. The facialexpression determination unit 312 causes neurons to learn facialexpression change determination by looking up a table 313 that storescorrespondence between motion vectors and facial expression changes.Processing in the predetermined feature point extraction unit 310 is thesame as in the first embodiment, and a description thereof will beomitted. In this embodiment, feature point coordinates are expressed onthe basis of a face detection position. However, the present inventionis not limited to this. The motion vector calculation unit 311calculates, on the basis of the face position detected by the faceposition detection unit 301, motion vectors each having an initial pointat the face position and an end point at a feature point. The facialexpression determination unit 112 determines facial expression by usingan NN, as in the first embodiment.

[Overall Processing]

Overall processing executed by the arrangement of this embodiment willbe described next with reference to FIG. 36. FIG. 36 is a flowchartshowing the procedure of overall processing according to thisembodiment.

In step S320, the face position detection unit 301 executes decimationand histogram correction of image data acquired by the image input unit300. The image resolution after decimation is, e.g., 360×240 [pixels].

In step S321, the face position detection unit 301 determines a faceposition in the image by using the CNN. The resolution of the inputimage to the CNN to determine a face position is further reduced to,e.g., 180×120 [pixels] by decimation.

In step S322, the facial expression recognition unit 302 determineswhether a face is detected. If a face is detected (YES in step S322),the process advances to step S323. If no face is detected (NO in stepS322), the process returns to step S320 to execute the same processingfor the image data of the next frame.

In step S323, the predetermined feature point extraction unit 310 sets anostril feature point extraction range by using face and eye positionsextracted by the first CNN for face position detection.

In step S324, the predetermined feature point extraction unit 310extracts feature points by using the second CNN on the basis of theextraction range set in step S323. The resolution of the input image tothe second CNN for feature point extraction is, e.g., 360×240 [pixels].

In step S325, the predetermined feature point extraction unit 310determines whether all feature points are extracted by the processing insteps S323 and S324. If all feature points are extracted (YES in stepS325), the process advances to step S326. If not all feature points areextracted (NO in step S325), the process returns to step S320 to executethe same processing for the next frame.

In step S326, the motion vector calculation unit 311 calculates motionvectors of the feature points by comparing vectors calculated in thepreceding frame with those calculated in the current frame.

In step S327, facial expression is determined by using an NN for facialexpression determination on the basis of the motion vectors calculatedin step S326. The processing is complete.

Processing in each step will be described below in detail by explainingprocessing in each functional arrangement.

[Motion Vector Calculation Unit 311]

The function of the motion vector calculation unit 311 will be describednext in detail. The motion vector calculation unit 311 calculates, onthe basis of the face position detected by the face position detectionunit 301, motion vectors each having an initial point at the faceposition and an end point at a feature point. The number of motionvectors equals the number of feature points except the nostril featurepoint shown in FIG. 8.

Motion vector calculation will be described with reference to FIG. 32.FIG. 32 is a schematic view showing a vector that has the initial pointat the face detection position and the end point at the right lateralcanthus feature point in t [frame] and t+1 [frame] images.

Referring to FIG. 32, reference numeral 3201 denote a face detectionposition as a reference point; 3202, a lateral canthus feature point int [frame]; and 3203, a lateral canthus feature point in t+1 [frame]. Asshown in FIG. 32, in t [frame] and t+1 [frame], vectors c and b aredefined by setting the face detection position 3201 as an initial pointand the lateral canthus feature points 3202 and 3203 as end points. Amotion vector a is defined as a=b−c.

FIG. 33 is a schematic view showing calculation of a motion vector.Motion vectors are calculated similarly for the remaining featurepoints. A total of 18 motion vectors except for the nostril featurepoint are calculated. Instead of using t [frame] and t+1 [frame], t[frame] and t+2 [frame] or t+3 [frame] may be used in accordance withthe frame rate to calculate motion vectors.

The directions and sizes of the calculated motion vectors are changed byvariations. Normalization is executed to cope with a size change. Forexample, the size of each vector is represented on the basis of anintercanthal distance |f|.

For example, referring to FIG. 34, when a vector f is defined as thereference of normalization, a vector d can be expressed by d/|f| afternormalization. If the size varies, and the intercanthal distance changesto |g|, as shown in FIG. 35, a vector e in FIG. 35 can be expressed bye/|g| after normalization. With this normalization, if only the sizevaries without changes in face features such as the eyes and mouth, thevector d equals the vector e. This allows to suppress recognition errorscaused by the image sensing angle.

If horizontal rotational variation has occurred, only the horizontalcomponent of the vector in FIG. 34 changes. The magnitude of ahorizontal component d2 of the vector d in FIG. 34 is normalized inaccordance with rotational variation. In the normalization, rotation isdetected by using the face detection position and left and right eyedetection positions, and feature point layout is corrected on the basisof the detected rotation, as described in the first embodiment.

For example, in FIG. 23, the horizontal component of each vectorobtained from feature points in the rotational direction region ismultiplied by a2/b2. As shown in FIG. 33, the motion vector a iscalculated from b−c=a. The feature points in the rotational directionregion are, e.g., feature points 143, 144, 145, 150, 151, 152, 153, and158 in FIG. 8.

Even in vertical rotational variation, the magnitude of a verticalcomponent d1 of each of vectors obtained from all feature points exceptthe nostril feature point is multiplied by c/c3. After that, the motionvector a is calculated from b−c=a, as shown in FIG. 33.

The initial point of a vector calculated from feature points is notlimited to the above-described face detection position. Alternatively, anostril feature point position (feature point 154 in FIG. 8), medialcanthus feature points (feature points 149 and 150 in FIG. 8), eyedetection positions (right eye detection position 160 and left eyedetection position 161 in FIG. 9) obtained by the face detection CNN,and mouth detection position (163 in FIG. 9) may be used.

[Facial Expression Determination Unit 312]

The facial expression determination unit 312 will be described next. Thefacial expression determination unit 312 determines facial expression byusing NNs as in the first embodiment. In the first embodiment, 22normalized feature amount variations obtained by comparison with anexpressionless face prepared in advance in a storage device such as theHD 395 are input. In the second embodiment, for example, the horizontaland vertical components of 18 motion vectors, i.e. a total of 36 sizesand directions of vectors are input to an NN. For example, a motionvector (4,−3) can be decomposed to a horizontal component +4 and avertical component −3. The sizes and directions of the components ofvectors are input.

On the other hand, the output includes eight facial expressiondetermination neurons that output a value from “0” to “1”. The neuronsof the output system are the same as those of the first embodiment.Learning of facial expression will be described. As described in thefirst embodiment, when the face serving as the recognition target objectexhibits certain facial expression, specific feature amount variationsincrease/decrease. When the face serving as the recognition targetobject exhibits certain facial expression, motion vectors also havespecific directions and sizes. For this reason, when specific directionsand sizes of motion vectors representing certain facial expression areinput to the features of the input layer, the neuron in the outputlayer, which represents this facial expression is made to output a valueclose to “1”. Learning is thus performed.

The table 313 stores the correspondence between the parameters (e.g.,values representing directions and sizes) of motion vectors and facialexpressions. FIG. 41 is a view showing the contents of the table 313.The facial expression determination unit 312 controls learning of theneurons by looking up the table 313. For example, learning is controlledto increase the output level of “joy” if parameter 1 of motion vector 1defined in advance tends to increase while parameter 2 tends todecrease. The table 113 is defined in a storage device such as an HD 395in advance.

As described above, in the arrangement according to this embodiment,facial expression is determined on the basis of motion vectorscalculated on the basis of feature points in adjacent frames. Hence, achange in facial expression can efficiently be detected.

Third Embodiment

In the first and second embodiments, the information processingapparatus is assumed to be a PC, WS, or PDA. However, the presentinvention is not limited to this. For example, the above-describedarrangement may be implemented by an image sensing device such as adigital camera.

The arrangement of this embodiment incorporates face detection andfacial expression recognition functions in an image sensing device suchas a digital camera (camera) to make it possible to automatically detectthat an object exhibits preset desired facial expression (e.g., “joy”)and automatically record it. In addition, the recorded image isdisplayed.

FIG. 37 is a block diagram showing the functional arrangement of theinformation processing apparatus according to the third embodiment. Theinformation processing apparatus of this embodiment comprises an imageinput unit 400, face position detection unit 401, facial expressionrecognition unit 402, image display unit 403, and image storage unit404, as shown in FIG. 37.

The image input unit 400, face position detection unit 401, and facialexpression recognition unit 402 execute the same processing as in thefirst and second embodiments.

The image display unit 403 displays, on a display 397, an imagedetermined by the facial expression recognition unit 402 to have presetfacial expression. That is, image data temporarily stored in a buffermemory such as a RAM 392 is displayed on the display 397. At this time,image data may be interlaced every several pixels and displayed. In thiscase, high-speed display is possible.

The image storage unit 404 stores the image data displayed on thedisplay 397 in a storage device such as a RAM or memory (e.g., flashmemory) 394.

[Overall Processing]

Overall processing executed by the arrangement of this embodiment willbe described next with reference to FIG. 38. FIG. 38 is a flowchartshowing the procedure of overall processing according to thisembodiment.

In step S410, the face position detection unit 401 executes decimationand histogram correction of image data acquired by the image input unit400. The image resolution after decimation is, e.g., 360×240 [pixels].

In step S411, the face position detection unit 401 determines a faceposition in the image by using the CNN. The resolution of the inputimage to the CNN to determine a face position is further reduced to,e.g., 180×120 [pixels] by decimation.

In step S412, the facial expression recognition unit 402 determineswhether a face is detected. If a face is detected (YES in step S412),the process advances to step S413. If no face is detected (NO in stepS412), the process returns to step S410 to execute the same processingfor the image data of the next frame.

In step S413, the facial expression recognition unit 402 sets a nostrilfeature point extraction range by using face and eye positions extractedby the first CNN for face position detection.

In step S414, the facial expression recognition unit 402 extractsfeature points by using the second CNN on the basis of the extractionrange set in step S413. The resolution of the input image to the secondCNN for feature point extraction is, e.g., 360×240 [pixels].

In step S415, the facial expression recognition unit 402 determineswhether all feature points are extracted by the processing in steps S413and S414. If all feature points are extracted (YES in step S415), theprocess advances to step S416. If not all feature points are extracted(NO in step S415), the process returns to step S410 to execute the sameprocessing for the next frame.

In step S416, the facial expression recognition unit 402 calculatesmotion vectors of the feature points by comparing vectors calculated inthe preceding frame with those calculated in the current frame.

In step S417, facial expression is determined by using an NN for facialexpression determination on the basis of the motion vectors calculatedin step S416.

In step S418, it is determined whether facial expression is recognizedin step S417. If facial expression is recognized (YES in step S418), theprocess advances to step S419. If facial expression is not recognized(NO in step S418), the process returns to step S410 to continue theprocessing.

In step S419, image data with recognized facial expression is displayedon the display 397. This display is done at a lower resolution asneeded. In addition to the image data, a user interface to allow theuser to select whether to store the displayed image data in a storagedevice such as the medium 394 is displayed on the display 397.

If the user selects storage of image data in step S420 (YES in stepS420), the process advances to step S421. If storage is not selected (NOin step S420), the process returns to step S410 to continue theprocessing.

In step S421, the image data is stored in the medium 394 (e.g., flashmemory) at a high resolution. The processing is ended.

The processing in steps S418 to S421 may be executed in accordance with,e.g., the following manner. Facial expression to be displayed on thedisplay 397 and/or stored in a storage device such as the medium 394 isset in advance. In step S418, it is determined whether the recognitiontarget image is recognized to have the preset facial expression. If thefacial expression is recognized (YES in step S418), the process advancesto step S419. If the facial expression is not recognized (NO in stepS418), the process returns to step S410.

In step S419, the image data is displayed on the display 397. Theprocess advances to step S421 while skipping step S420.

In step S421, the image data is stored in a storage device such as themedium 394.

As described above, the image sensing device according to thisembodiment automatically recognizes facial expression of image data anddisplays and stores only image data corresponding to preset facialexpression. Hence, the user can take a desired image without missing theshutter chance.

Other Embodiment

The embodiments of the present invention have been described above indetail. The present invention can take a form such as a system,apparatus, method, program, or storage medium. More specifically, thepresent invention is applicable to a system including a plurality ofdevices or an apparatus including a single device.

The present invention is also achieved even by supplying a program whichimplements the functions of the above-described embodiments to thesystem or apparatus directly or from a remote site and causing thecomputer of the system or apparatus to read out and execute the suppliedprogram codes.

Hence, the program code itself, which is installed in a computer toimplement the functional processing of the present invention by thecomputer, is also incorporated in the technical scope of the presentinvention. That is, the present invention also incorporates a computerprogram to implement the functional processing of the present invention.

In this case, the program can take any form such as an object code, aprogram to be executed by an interpreter, or script data to be suppliedto the OS as long as the functions of the program can be obtained.

The recording medium to supply the program includes, e.g., a Floppy®disk, hard disk, optical disk, magnetooptical disk, MO, CD-ROM, CD-R,CD-RW, magnetic tape, nonvolatile memory card, ROM, or DVD (DVD-ROM orDVD-R).

As another program supply method, a client computer may be connected toa homepage on the Internet by using a browser in the client computer,and the computer program itself of the present invention or a compressedfile containing an automatic install function may be downloaded from thehomepage to a recording medium such as a hard disk. The program codecontained in the program of the present invention may be divided into aplurality of files, and the files may be downloaded from differenthomepages. That is, a WWW server which causes a plurality of users todownload a program file that causes a computer to implement thefunctional processing of the present invention is also incorporated inthe claim of the present invention.

The following supply form is also available. The program of the presentinvention may be encrypted, stored in a storage medium such as a CD-ROM,and distributed to users. Any user who satisfies predeterminedconditions may be allowed to download key information for decryptionfrom a homepage through the Internet, execute the encrypted program byusing the key information, and install the program in the computer. Theabove-described supply form is also available.

The functions of the above-described embodiments are implemented notonly when the readout program is executed by the computer but also when,e.g., the OS running on the computer performs part or all of actualprocessing on the basis of the instructions of the program.

The functions of the above-described embodiments are also implementedwhen the program read out from the recording medium is written in thememory of a function expansion board inserted into the computer or afunction expansion unit connected to the computer, and the CPU of thefunction expansion board or function expansion unit performs part or allof actual processing on the basis of the instructions of the program.

As described above, according to the embodiments, a technique ofrecognizing a face at a high accuracy under various image sensingconditions can be provided.

While the present invention has been described with reference toexemplary embodiments, it is to be understood that the invention is notlimited to the disclosed exemplary embodiments. The scope of thefollowing claims is to be accorded the broadest interpretation so as toencompass all such modifications and equivalent structures andfunctions.

This application claims the benefit of Japanese Patent Application No.2005-278782, filed Sep. 26, 2005, and Japanese Patent Application No.2005-278783, filed Sep. 26, 2005, which are hereby incorporated byreference herein in their entirety.

What is claimed is:
 1. An information processing apparatus comprising: an input unit adapted to input image data containing a face; a first detection unit adapted to detect, from the image data, a position of a specific part of the face; a second detection unit adapted to detect a feature point of the face from the image data on the basis of the detected position of the specific part; and a determination unit adapted to determine facial expression of the face on the basis of the detected feature point, wherein said second detection unit has higher detection accuracy than detection accuracy of said first detection unit, and said first detection unit is robust to a variation in a detection target. 